In this practical, I’m going to walk-through a complete process of data importing. This will draw on learning from Week One as well as pre-class reading for this week.
4.1 Part One: Working with CSV Data (1)
Step One: Creation of a Synthetic Dataset
Demonstration
Run the following code. As you run each line, attempt to think-through what the code is doing.
Note how useful (or not) the comments are in understanding what the code is doing.
Note what happens in your Environment window when you run each line.
# Set seed for reproducibilityset.seed(42)# Create synthetic data for an ice hockey datasetplayer_id <-paste("Player", 1:100) # Player idteam_names <-sample(c("Team A", "Team B", "Team C", "Team D"), 100, replace =TRUE) # Teamsgoals_scored <-sample(0:10, 100, replace =TRUE) # Goals scored by each playerassists <-sample(0:15, 100, replace =TRUE) # Assists by each playerpenalty_minutes <-sample(0:20, 100, replace =TRUE) # Penalty minutes# Combine into a data framehockey_data <-data.frame(Player = player_id,Team = team_names,Goals = goals_scored,Assists = assists,Penalty_Minutes = penalty_minutes)# Display the first few rows of the datahead(hockey_data)
Player Team Goals Assists Penalty_Minutes
1 Player 1 Team A 1 8 8
2 Player 2 Team A 7 0 10
3 Player 3 Team A 0 13 5
4 Player 4 Team A 1 12 9
5 Player 5 Team B 4 14 8
6 Player 6 Team D 7 13 18
Practice
Now, create a new data set called [hockey_data_02]. Change the team names, the range of goals scored, and at least one of the variable names.
Step Two: Saving the Data
Demonstration
In the following code, I save my dataframe ‘hockey_data’ as a .csv file. Notice where this file is stored (look in the ‘Files’ window).
# Save the dataset as a CSV filewrite.csv(hockey_data, file ="hockey_data.csv", row.names =FALSE)
Practice
Repeat the above, saving your dataframe ‘hockey_data_02’ as a .csv file.
Now, save the same dataframe to your University OneDrive folder.
Step Three: Importing the Data
Demonstration
First, I am going to clear my environment. You can do this with the brush tool, or use:
rm(list=ls()) # This clears the environment
# Import the dataset back into Rimported_data <-read.csv("hockey_data.csv")# Display the imported datahead(imported_data)
Player Team Goals Assists Penalty_Minutes
1 Player 1 Team A 1 8 8
2 Player 2 Team A 7 0 10
3 Player 3 Team A 0 13 5
4 Player 4 Team A 1 12 9
5 Player 5 Team B 4 14 8
6 Player 6 Team D 7 13 18
Practice
Repeat these steps for the ‘hockey_data_02’ file you saved earlier.
4.2 Part Two: Working with CSV Data (2)
Now, we’ll move on to explore the differences between importing and exporting CSV files in R with and without row names.
We’ll start by creating a synthetic dataset based on netball player statistics, and then demonstrate how to save this dataset to a CSV file both with and without row names.
Finally, we’ll show how to import these CSV files back into R.
Step 1: Create a Synthetic Dataset
Let’s begin by creating a synthetic dataset that contains information about netball players. This dataset will include columns for [Player], [Position], [Goals], and [Assists].
Player Position Goals Assists
1 Alice Goal Shooter 45 10
2 Bella Wing Attack 30 25
3 Catherine Goal Keeper 0 5
4 Diana Centre 15 20
5 Emily Goal Defense 5 10
Step 2: Exporting the CSV File
Now, we’ll export this dataset to a CSV file. We’ll do this twice: once with row names and once without.
# Export with row nameswrite.csv(netball_data, "netball_with_rownames.csv", row.names =TRUE)
Check your working directory. You should see a CSV file named [netball_with_rownames.csv].
# Export without row nameswrite.csv(netball_data, "netball_without_rownames.csv", row.names =FALSE)
This will create a CSV file named [netball_without_rownames.csv] without any row numbers as a separate column.
Step 3: Importing the CSV Files
Next, we’ll import these CSV files back into R to observe the differences.
# Import with row namesnetball_with_rownames <-read.csv("netball_with_rownames.csv", row.names =1)print(netball_with_rownames)
Player Position Goals Assists
1 Alice Goal Shooter 45 10
2 Bella Wing Attack 30 25
3 Catherine Goal Keeper 0 5
4 Diana Centre 15 20
5 Emily Goal Defense 5 10
Notice that, by specifying row.names = 1, we tell R to use the first column as row names. The imported data will appear similar to the original dataframe.
# Import without row namesnetball_without_rownames <-read.csv("netball_without_rownames.csv")print(netball_without_rownames)
Player Position Goals Assists
1 Alice Goal Shooter 45 10
2 Bella Wing Attack 30 25
3 Catherine Goal Keeper 0 5
4 Diana Centre 15 20
5 Emily Goal Defense 5 10
Since we exported this file without row names, we don’t need to specify the row.names parameter. The data will be imported as is, with R automatically assigning default row numbers starting from 1.
Step 4: Comparing the Datasets
Let’s compare the datasets to see the difference:
# Compare the datasetsprint("Dataset with Row Names:")
[1] "Dataset with Row Names:"
print(netball_with_rownames)
Player Position Goals Assists
1 Alice Goal Shooter 45 10
2 Bella Wing Attack 30 25
3 Catherine Goal Keeper 0 5
4 Diana Centre 15 20
5 Emily Goal Defense 5 10
print("Dataset without Row Names:")
[1] "Dataset without Row Names:"
print(netball_without_rownames)
Player Position Goals Assists
1 Alice Goal Shooter 45 10
2 Bella Wing Attack 30 25
3 Catherine Goal Keeper 0 5
4 Diana Centre 15 20
5 Emily Goal Defense 5 10
In the first dataset, you’ll see that the row names from the original export have been preserved, while in the second, R has automatically assigned new row numbers starting from 1.
Step 5: Adding Missing Row Names
If your data doesn’t have meaningful row names, you might want to add them. Adding row names can be particularly useful for identifying rows in your dataset when the row names have a specific, meaningful context, such as player names or unique identifiers.
# Create a synthetic netball datasetnetball_data <-data.frame(Player =c("Alice", "Bella", "Catherine", "Diana", "Emily"),Position =c("Goal Shooter", "Wing Attack", "Goal Keeper", "Centre", "Goal Defense"),Goals =c(45, 30, 0, 15, 5),Assists =c(10, 25, 5, 20, 10))# Assign player names as row namesrow.names(netball_data) <- netball_data$Player# Remove the Player columnnetball_data <- netball_data[ , -1]# View the updated datasetprint(netball_data)
Position Goals Assists
Alice Goal Shooter 45 10
Bella Wing Attack 30 25
Catherine Goal Keeper 0 5
Diana Centre 15 20
Emily Goal Defense 5 10
# Export with row nameswrite.csv(netball_data, "netball_with_custom_rownames.csv", row.names =TRUE)
Setting Variable Types
Examine the dataset (which function/s can you use?)
Are all the variables interpreted correctly when R imports the CSV file?
If not, what do you need to do to correct this?
4.3 Practical Activity
Using the code provided above, create your own synthetic dataset.
Save the dataset as a .csv file.
Clear your environment, then import your dataset back into the environment as a dataframe.